918 research outputs found

    Optimization Scheme of Joint Noise Suppression and Dereverberation Based on Higher-Order Statistics

    Get PDF
    APSIPA ASC 2012 : Asia-Pacific Signal and Information Processing Association 2012 Annual Summit and Conference, December 3-6, 2012, Hollywood, California, USA.In this paper, we apply the higher-order statistics parameter to automatically improve the performance of blind speech enhancement. Recently, a method to suppress both diffuse background noise and late reverberation part of speech has been proposed combining blind signal extraction and Wiener filtering. However, this method requires a good strategy for choosing the set of its parameters in order to achieve the optimum result and to control the amount of musical noise, which is a common problem in non-linear signal processing. We present an optimization scheme to control the value of Wiener filter coefficients used in this method, which depends on the amount of musical noise generated, measured by higher-order statistics. The noise reduction rate and cepstral distortion are also evaluated to confirm the effectiveness of this scheme

    Consonant Recognition by Modular Constructlon of Large Phonemic Time-Delay Neural Networks

    Get PDF
    Abstract In this paper we show that neural networks for speech recognition can be constructed in a modular fashbn by expbiting the hidden structure of prevbusly trained phonetic subcategory networks. The performance of resulting larger phonetic nets was found to be as good as the performance of the subcomponent nets by themselves. This approach avoids the excessive learning times that would ba necessary to train larger networks and allows for incremental learning. Large time-delay neural networks constructed incrementally by applying these modular training techniques achieved a rewgnition performance of 96.0% for ail consonants and 94.7% for all phonemes. Introductlon Recently we have demonstrated that connectionist architectures capable of capturing some critical aspects of the dynamic nature of speech, can achieve superior recognition performance for daficuit but small phonemic discrimination tasks such as discrimination of the voiced consonants B, D and G 11.21. Encouraged by these results we wanted to explore the vestion, how we might expand on these models to make them useful for the design of speech recognition systems. A problem that emerges as we attempt to apply neural network models to the full speech recognition problem is the problem of scaling. Simply extending neural networks to ever larger structures and retraining them as one monolithic net quickly exceeds the capabilities of the fastest and largest supercomputers. The search complexity of finding a g w d solutions in a huge space of possible network configurations also soon assumes unmanageable proportions. Moreover, having to decide on all possible classes for recognition ahead of time as well as collecting sufficient data to train such a large monolithic network is impractical to say the least. In an effort to extend our models from mail recognition tasks to large scale speech recognition systems, we must therefore explore modularity and incremental learning as design strategies to break up a large learning task into smaller subtasks. Breaking up a large task into subtasks to be tackled by individual black boxes interconnected in ad hoc arrangements, on the other hand, would mean to abandon one of the most anractiie aspects of connectionism: the ability to perform complex constraint satistaction in a massively parallel and interconnected fashion, in view of an overall optimal performance goal. In this paper we demonstrate based on a set of experiments aimed at phoneme recognition that it is indeed possible to consttuct large neural networks incrementally by exploiting the hidden structure of smaller pretrained subcomponent networks. Small Phonemlc Classes by Time-Delay Neural In our previous work. we have proposed a Time-Delay Neural Network architecture (as shown on the left of Its mltilayer architecture, its shim- Osaka. 540, Japan invariance and the time delayed connectins of its units all contributed to its performance by allowing the net to develop complex, non-linear decislon surfaces and insensitivity to misalignments and by incorporating contextual information into decision making (see [l, 21 for detailed analysis and discussion). It is trained by the back-propagation pmcedure[3] using shared weighls for different time shifted positions of the net [l, 21. In spirit it has simi!arities to other models recently proposed [4,5]. This network. however, had only been trained for the voiced stops 6,D.G and we began our extensions by training similar networks for the other phonemic classes in our database. Ail phoneme tokens in our experiments were extracted using phonetic handlabels from a large vocabulary database of 5240 common Japanese words. Each word in the database was spoken in isolation by one male native Japanese speaker. All utterances were recorded in a sound pmoi booth and digitized at a 12 kHz sampling rate. The database was then split into a training set and a testing set of 2620 utterances each. A 150 msec range around a phoneme boundary was excised for each phoneme token and 1

    Response Generation based on Statistical Machine Translation for Speech-Oriented Guidance System

    Get PDF
    Abstract-An example-based response generation is a robust and practical approach for a real-environment information guidance system. However, this framework cannot reflect differences in nuance, because the set of answer sentences are fixed beforehand. To overcome this issue, we have proposed response generation using a statistical machine translation technique. In this paper, we make use of N-best speech recognition candidates instead of manual transcription used in our previous study. As a result, the generation rate of appropriate response sentences was improved by using multiple recognition hypothesis

    Silent-speech enhancement using body-conducted vocal-tract resonance signals

    Get PDF
    The physical characteristics of weak body-conducted vocal-tract resonance signals called non-audible murmur (NAM) and the acoustic characteristics of three sensors developed for detecting these signals have been investigated. NAM signals attenuate 50 dB at 1 kHz; this attenuation consists of 30-dB full-range attenuation due to air-to-body transmission loss and 10 dB/octave spectral decay due to a sound propagation loss within the body. These characteristics agree with the spectral characteristics of measured NAM signals. The sensors have a sensitivity of between 41 and 58 dB [V/Pa] at I kHz, and the mean signal-to-noise ratio of the detected signals was 15 dB. On the basis of these investigations, three types of silent-speech enhancement systems were developed: (1) simple, direct amplification of weak vocal-tract resonance signals using a wired urethane-elastomer NAM microphone, (2) simple, direct amplification using a wireless urethane-elastomer-duplex NAM microphone, and (3) transformation of the weak vocal-tract resonance signals sensed by a soft-silicone NAM microphone into whispered speech using statistical conversion. Field testing of the systems showed that they enable voice impaired people to communicate verbally using body-conducted vocal-tract resonance signals. Listening tests demonstrated that weak body-conducted vocal-tract resonance sounds can be transformed into intelligible whispered speech sounds. Using these systems, people with voice impairments can re-acquire speech communication with less effort. (C) 2009 Elsevier B.V. All rights reserved.ArticleSPEECH COMMUNICATION. 52(4):301-313 (2010)journal articl

    Interactive controller for audio object localization based on spatial representative vector operation

    Get PDF
    Abstract-In this paper, we propose a new interactive controller for audio object localization based on spatially representative vector operations on a stereo mixed source. First, we developed the interactive controller, which is equipped with a capacitive touchscreen panel so that the listener can intuitively operate audio objects displayed on the touchscreen panel with a touch pen. Next, we assessed the perceptual effects of localization and the sound quality of an audio object after performing individual operations to verify the operation of the interactive controller via a subjective evaluation. The results of the experiments clarify that the interactive controller enables the listener to change the gain and the localization of audio objects without sound degradation if the gain operation is not extreme

    Semi-blind suppression of internal noise for hands-free robot spoken dialog system

    Get PDF
    Abstract-The speech enhancement architecture presented in this paper is specifically developed for hands-free robot spoken dialog systems. It is designed to take advantage of additional sensors installed inside the robot to record the internal noises. First a modified frequency domain blind signal separation (FD-BSS) gives estimates of the noises generated outside and inside of the robot. Then these noises are canceled from the acquired speech by a multichannel Wiener post-filter. Some experimental results show the recognition improvement for a dictation task in presence of both diffuse background noise and internal noises
    • …
    corecore